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Abstract 

We continue the investigation of problems concerning correlation clustering or cluster- 
ing with qualitative information, which is a clustering formulation that has been studied re- 
cently (5J 131^113. The basic setup here is that we are given as input a complete graph on 
n nodes (which correspond to nodes to be clustered) whose edges are labeled + (for similar 
pairs of items) and — (for dissimilar pairs of items). Thus we have only as input qualitative 
information on similarity and no quantitative distance measure between items. The quality of 
a clustering is measured in terms of its number of agreements, which is simply the number of 
edges it correctly classifies, that is the sum of number of — edges whose endpoints it places 
in different clusters plus the number of + edges both of whose endpoints it places within the 
same cluster. 

In this paper, we study the problem of finding clusterings that maximize the number of 
agreements, and the complementary minimization version where we seek clusterings that min- 
imize the number of disagreements. We focus on the situation when the number of clusters is 
stipulated to be a small constant k. Our main result is that for every k, there is a polynomial 
time approximation scheme for both maximizing agreements and minimizing disagreements. 
(The problems are NP-hard for every k > 2.) The main technical work is for the minimization 
version, as the PTAS for maximizing agreements follows along the lines of the property tester 
for Max fc-CUT from ITT1 . 

In contrast, when the number of clusters is not specified, the problem of minimizing dis- 
agreements was shown to be APX-hard 0, even though the maximization version admits a 
PTAS. 
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1 Introduction 



In this work, we continue the investigation of problems concerning an appealing formulation of 
clustering called correlation clustering or clustering using qualitative information that has been 
studied recently in several works, including |@][T4l|3l|5l|6ll3. The basic setup here is to cluster 
a collection of n items given as input only qualitative information concerning similarity between 
pairs of items; specifically for every pair of items, we are given a (Boolean) label as to whether 
those items are similar or dissimilar. We are not provided with any quantitative information on 
how different pairs of elements are, as is typically assumed in most clustering formulations. These 
formulations take as input a metric on the items and then aim to optimize some function of the 
pairwise distances of the items within and across clusters. The objective in our formulation is to 
produce a partitioning into clusters that places similar objects in the same cluster and dissimilar 
objects in different clusters, to the extent possible. 

An obvious graph-theoretic formulation of the problem is the following: given a complete 
graph on n nodes with each edge labeled either "+" (similar) or "— " (dissimilar), find a partitioning 
of the vertices into clusters that agrees as much as possible with the edge labels. The maximization 
version, call it MaxAgree seeks to maximize the number of agreements: the number of + edges 
inside clusters plus the number of — edges across clusters. The minimization version, denoted 
MinDisAgree, aims to minimize the number of disagreements: the number of — edges within 
clusters plus the number of + edges between clusters. 

In this paper, we study the above problems when the maximum number of clusters that we are 
allowed to use is stipulated to be a fixed constant k. We denote the variants of the above problems 
that have this constraint as Max Agree [A;] and MinDisAgree [A;]. We note that, unlike most 
clustering formulations, the MaxAgree and MinDisAgree problems are not trivialized if we 
do not specify the number of clusters k as a parameter — whether the best clustering uses few 
or many clusters is automatically dictated by the edge labels. However, the variants we study are 
also interesting formulations, which are well-motivated in settings where the number of clusters 
might be an external constraint that has to be met, even if there are "better" clusterings (i.e., one 
with more agreements) with a different number of clusters. Moreover, the existing algorithms for, 
say MinDisAgree, cannot be modified in any easy way to output a quality solution with at most 
k clusters. Therefore ^-clustering variants pose new, non-trivial challenges that require different 
techniques for their solutions. 

In the above description, we have assumed that every pair of items is labeled as + or — in the 
input. In a more general variant, intended to capture situations where the classifier providing the 
input might be unable to label certain pairs of elements are similar or dissimilar, the input is an 
arbitrary graph G together with ± labels on its edges. We can again study the above problems 
MaxAgree[Aj] (resp. MinDisAgree[A;]) with the objective being to maximize (resp. minimize) 
the number of agreements (resp. disagreements) on edges of E (that is, we do not count non-edges 
of G as either agreements or disagreements). In situations where we study this more general vari- 
ant, we will refer to these problems as MaxAgree[A;] on general graphs and MinDisAgree[/c] 
on general graphs. When we don't qualify with the phrase "on general graphs", we will always 
mean the problems on complete graphs. 

Our main result in this paper is a polynomial time approximation scheme (PTAS) for MaxAgree[A;] 
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as well as MinDisAgree [A;] for A; > 2. We now discuss prior work on these problems, followed 
by a more detailed description of results in this paper. 

1.1 Previous and related work 

The above problem seems to have been first considered by Ben-Dor et al. El motivated by some 
computational biology questions. Later, Shamir et al. lfT4ll studied the computational complex- 
ity of the problem and showed that MaxAgree (and hence also MinDisAgree), as well as 
Max Agree [A;] (and hence also MinDisAgree[A;]) for each k > 2 is NP-hard. They, however, 
used the term "Cluster Editing" to refer to this problem. 

Partially motivated by some machine learning problems concerning document classification, 
Bansal, Blum, and Chawla also independently formulated and considered this problem. In par- 
ticular, they initiated the study of approximate solutions to MinDisAgree and MaxAgree, and 
presented a PTAS for MaxAgree and a constant factor approximation algorithm for MinDis- 
Agree (the approximation guarantee was a rather large constant, though). They also noted a sim- 
ple factor 3 approximation algorithm for MinDisAgree [2]. Charikar, Gurus wami and Wirth 
proved that MinDisAgree is APX-hard, and thus one cannot expect a PTAS for the minimization 
problem similar to the PTAS for MaxAgree. They also gave a factor 4 approximation algorithm 
for MinDisAgree by rounding a natural linear programming relaxation using the region growing 
technique. 

The problems on general graphs have also received attention. It is known that both MaxA- 
gree and MinDisAgree are APX-hard 00. Using a connection to minimum multicut, several 
groups 0|9l[TO| presented an O(logn) approximation algorithm for MinDisAgree. In fact, it 
was noted in [ 10] that the problem is as hard to approximate as minimum multicut (and so this log n 
factor seems very hard to improve). For the maximization version, algorithms with performance 
ratio better than 0.766 are known for MaxAgree J5J[T5). The latter work by Swamy llT51l shows 
that a factor 0.7666 approximation can also be achieved when the number of clusters is specified 
(i.e., for MaxAgree [A;] for k > 2). 

Another problem that has been considered, let us call it MaxCorr, is that of maximizing 
correlation, defined to be the difference between the number of agreements and disagreements. A 
factor O (log n) approximation for MaxCorr on complete graphs is presented in [6 1. Recently 
showed an integrality gap of fi(logn) for the underlying semidefinite program used in 0. They 
also prove that an approximation of 0(log 6(G)) can be achieved on general graphs G, where 6(-) 
is the Lovasz Theta Function. 

1.2 Our results 

The only previous approximation for MinDisAgree [A;] was a factor 3 approximation algorithm 
for the case k = 2 0. The problems were shown to be NP-hard for every k > 2 in lfi"4ll using 
a rather complicated reduction. In this paper, we will provide a much simpler NP-hardness proof 
and prove that both MaxAgree [A;] and MinDisAgree [A;] admit a polynomial time approxima- 
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tion scheme for every k > 2. 1 These approximation schemes are presented in Section |3] and 0] 
respectively. The existence of a PTAS for MinDisAgree[&;] is perhaps surprising in light of the 
APX-hardness of MinDisAgree when the number of clusters is not specified to be a constant 
(recall that the maximization version does admit a PTAS even when k is not specified). 

It is often the case that minimization versions of problems are harder to solve compared to their 
complentary maximization versions. The APX-hardness of MinDisAgree despite the existence 
of a PTAS for MaxAgree is a notable example. The difficulty in these cases is when the optimum 
value of the minimization version is very small, since then even a PTAS for the complementary 
maximization problem need not provide a good approximation for the minimization problem. In 
this work, we first give a PTAS for MaxAgree [&;]. This algorithm uses random sampling and 
follows closely along the lines of the property testing algorithm for Max A;-Cut due to ifTTl. We 
then develop a PTAS for MinDisAgree [A;], which is our main result. This requires more work 
and the algorithm returns the better of two solutions, one of which is obtained using the PTAS for 
MaxAgree [A;]. 

The difficulty in getting a PTAS for the minimization version is similar to that faced in the 
problem of Min A;-sum clustering, which has the complementary objective function to Metric 
Max /c-Cut. We remark that while an elegant PTAS for Metric Max A;-Cut due to de la Vega 
and Kenyon @ nas been known for several years, only recently has a PTAS for Min A;-sum clus- 
tering been obtained |7j. We note that the case of Min 2-sum clustering though was solved in lfT2l 
soon after the Metric Max Cut algorithm of JSJ, but the case k > 2 appeared harder. Similarly to 
this, for MinDisAgree [A;], we are able to quite easily give a PTAS for the 2-clustering version 
using the algorithm for MaxAgree[2], but we have to work harder for the case of k > 2 clusters. 
Some of the difficulty that surfaces when k > 2 is detailed in Section |4~T1 

In Section[5l we also note some results on the complexity of MaxAgree [k] and MinDisAgree [k] 
on general graphs — these are easy consequences of connections to problems like Max CUT and 
graph colorability. 

Our work seems to nicely complete the understanding of the complexity of problems related to 
correlation clustering. Our algorithms not only achieve excellent approximation guarantees but are 
also sampling-based and are thus simple, combinatorial, and quite easy to implement. To compare 
with the situation for the case when k is not specified, the algorithm for MinDisAgree in Q 
achieves a very large approximation factor. On the other hand, the algorithm in Q achieves a good 
factor of 4 but needs to solve a linear programming relaxation. In fact it could well be that on some 
instances we can find a better solution compared to what the algorithm of [5 J produces even when 
k isn't specified by trying our algorithm for some small values of k. 

2 NP-hardness of MinDisAgree and MaxAgree 

In this section we show that the exact versions of problems we are trying to solve are NP-hard. An 
NP-hardness result for MaxAgree on complete graphs was shown in Q; however their reduction 

'Our approximation schemes will be randomized and deliver a solution with the claimed approximation guarantee 
with high probability. For simplicity, we do not explicitly mention this from now on. 
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crucially relies on the number of clusters growing with the input size, and thus does not yield any 
hardness when the number of clusters is a fixed constant k. It was shown by Shamir, Sharan, and 
Tsur ifbUl . using a rather complicated reduction, that these problems are NP-hard for each fixed 
number k > 2 of clusters. We will provide a short and intuitive proof that MinDisAgree[A;] and 
M AX Agree [k] are NP-hard. 

Clearly it suffices to establish the NP-hardness of MinDisAgree[&;] since M ax Agree [k] can 
be easily reduced on a complimentary graph. We will first establish NP-hardness for k = 2, the 
case for general k will follow by a simple "padding" with (k — 2) large collection of nodes with + 
edges between nodes in each collection and — edges to everywhere else. 

Theorem 1 MinDisAgree[2] on complete graphs is NP-hard. 

Proof : We know that Graph Min Bisection, namely partitioning the vertex set of a graph into two 
equal halves so that the number of edges connecting vertices in different halves is minimized, is 
NP-hard. From an instance G of Min Bisection with n(even) vertices we obtain a complete graph 
G 1 using the following polynomial time construction. 

Start with G and label all existing edges of G as + edges in G' and non-existing edges as — 
edges. For each vertex v create an additional set of n vertices. Let's call these vertices together 
with v, a "group" V v . Connect with + edges all pairs of vertices within V v . All other edges with 
one endpoint in V v as labeled as — edges (except those already labeled). 

We will now show that any 2-clustering of G' with the minimum number of disagreements, has 
2 clusters of equal size with all vertices of any group in the same cluster. Consider some optimal 
2-clustering W with 2 clusters W\ and W 2 such that | W\\ ^ \W 2 \ or not all vertices of some group 
are in the same cluster. Pick some group V v such that not all its vertices are assigned in the same 
cluster. If such a group cannot be found, pick a group V v from the larger cluster. Place all the 
vertices of the group in the same cluster obtaining W such that \ \W[\ — \W 2 \\ is minimized. 

Let's assume that vertices of group V v were in W\ and in W 2 . Wlog, let's assume that 
W is obtained by moving the group vertices in cluster W 2 . 

w{ = w 1 \v v \w 2 ' = w 2 uvt 

We now observe the following facts about the difference in the number of disagreements be- 
tween W and W. 

• Clearly the number of disagreements between vertices not in V v and between one vertex in 

with one in W[ remains the same. 

• The number of disagreements is decreased by | V^ 1 1 ■ \V„\ based on the fact that all edges 
within V v are + edges. 

• It is also decreased by at least IV^ 1 ] • \W[\ — (n — 1) based on the fact that all but at most 
n — 1 edges connecting vertices of V v to the rest of the graph are — edges. 

• The number of disagreements increases at most \ V^\ • \ W 2 \ V*\ because (possibly) all of the 
vertices in are connected with — edges with vertices in W 2 outside their group. 
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Overall, the difference in the number of disagreements is at most \V^\ • \W 2 \ V 2 \ — \V„ \ ■ 
I K 2 1 — I K) 1 1 ' I W{ I + ( n — 1) • Notice that since 1 1 W[ \ — \ W 2 \ | was minimized it must be the case that 
I W[ | > | W 2 \ V„ | . Moreover since a group has an odd number of vertices and the total number of 
vertices of G' is even, it follows that \W{\ ^ \W 2 \ V 2 | and | W[ \ - \ W 2 \V 2 \ > 1 . Therefore the total 
number of disagreements increases at most (n — 1) — | V^ 1 1 ■ (\V 2 \ + 1). Since \ V^ \ + \V 2 \ = n + 1 
and cannot be empty, it follows that l"^ 1 ] • (\V 2 \ + 1) > n and the number of disagreements 
strictly decreases contradicting the optimality of W . 

Therefore the optimal solution to the MinDisAgree[2] instance has 2 clusters of equal size 
and all vertices of any group are contained in a single cluster. It is now trivial to see that an optimal 
solution to the Min Bisection problem can be easily derived from the MinDisAgree[2] solution 
which completes the reduction. ■ 

We are now able to easily derive the following NP-hardness result. 

Theorem 2 For every k > 2, the problems M AX Agree [k] and MinDis Agree [£;] on complete 
graphs are NP-hard. 

Proof: Consider an instance of the MinDisAgree[2] problem on a graph G with n vertices. 
Create a graph G' by adding to G, k — 2 "groups" of n + 1 vertices each. All edges within a group 
are marked as + edges, while the remaining edges are marked as — edges. 

Consider now a ^-clustering of G' such that the number of disagreements is minimized. It is 
easy to see that all the vertices of a group must make up one cluster. Also observe that any of the 
original vertices cannot end up in one group's cluster since that would induce n + 1 disagreements, 
stricly more than it could possibly induce in any of the 2 remaining clusters. Therefore the 2 
non-group clusters are an optimal 2-clustering of G. The theorem easily follows. ■ 



3 PTAS for maximizing agreement with k clusters 

In this section we will present a PTAS for Max Agree [A;] for every fixed constant k. Our algorithm 
follows closely the PTAS for Max fc-CUT by Goldreich et al. lfTTI . In the next section, we will 
present our main result, namely a PTAS for MinDisAgree[A;], using the PTAS for MaxAgree[/c] 
together with additional ideas. 2 

Theorem 3 For every k > 2, there is a polynomial time approximation scheme for Max Agree [fc]. 

Proof: We first note that for every k > 2, and every instance of MaxAgree[A;], the optimum 
number OPT of agreements is at least n 2 /16. Let n + be the number of positive edges, and n_ = 
(2) — n + ^ e me num ber of negative edges. By placing all vertices in a single cluster, we get 
n + agreements. By placing vertices randomly in one of k clusters, we get an expected (1 — 
l/k)n- agreements just on the negative edges. Therefore OPT > max{n + , (1 — l/k)n^} > 
(1 — /2 > n 2 /16. The proof now follows from Theorem HI which guarantees a solution 

within additive en 2 of OPT for arbitrary e > 0. ■ 

2 This is also similar in spirit, for example, to the PTAS for Min 2-sum clustering based on the PTAS for Metric 
Max CUT Itl2ll8l. 
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Theorem 4 On input e, 5 and a labeling £ of the edges of a complete graph G with n vertices, 
with probability at least 1 — 5, algorithm MaxAg outputs a k-clustering of the graph such that 
the number of agreements induced by this k-clustering is at least OPT — en 2 /2, where OPT is the 
optimal number of agreements induced by any k-clustering ofG. The running time of the algorithm 

i s n^O{e- 2 \ogk\og(l/e5)) 

£ 

The proof of this theorem is presented in Section 13.21 and we now proceed to describe the 
algorithm. 

Algorithm MaxAg(/;;, e): 

Input: A labeling C : — > {+, — } of the edges of the complete graph on vertex set V. 
Output: A ^-clustering of the graph, i.e., a partition of V into (at most) k parts Vi, V2, ■ ■ ■ , Vk- 

1. Construct an arbitrary partition of the graph (V 1 , V 2 , . . . , V m ),m = |~|] . 

2. For i — 1 . . . m, choose uniformly at random with replacement from V \V\ 

a subset S l of size r = 6 log ^ log k) . 

3. For % — 1 ... m do the following 

(a) For each clustering of S l into (S\, . . . , Sl) do the following 

(i) For each vertex v E V 1 do 

' (1) For j = 1 ... k, let p^v) = \T+(v) nS}\ + £^ \r~(v) D Sj\. 
(2) Place v in Wj for which fij(v) is maximized. 

(ii) If the clustering on the subgraph induced by the edges between and S l 

has more agreements than the currently stored one, store this clustering as W = (W{, . . . , W^) 

4. For j = 1 . . . k, let Wj = UjW/. Output clustering (Wi, . . . , W k ). 

3.1 Overview 

Our algorithm is given a complete graph G(V, E) on n vertices. All the edges are marked as + or 
— , denoting whether adjacent vertices are on agreement or disagreement respectively. For a vertex 
v, let T + (v) be the set of vertices adjacent to v via + edges, and r~(V) the set of vertices adjacent 
to v via — edges. 

The algorithm works in m = 0{l/e) steps. At each step we are placing 9 (en) vertices into 
clusters. We will show that with constant probability our choices of Si's will allow us to place 
the vertices in such a way that the decrease in the number of agreements with respect to an op- 
timal clustering is 0(e 2 n 2 ) per step, thus the algorithm outputs a solution that has 0(en 2 ) less 
agreements than any optimal solution. 

3.2 Performance analysis of MaxAg(k, e) algorithm 

Consider an arbitrary optimal /^-clustering of the graph D = {D x , . . . , D k ). We consider the 
subsets of each cluster over our partition of vertices, defined as 

forj = l,...k, D] = DjDV 1 

D* = (D[,...,Di) 
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We will now define a sequence of hybrid clusterings, such that hybrid clustering H\ for i = 
1, 2, . . . , m + 1, consists of the vertices as clustered by our algorithm up to (not including) the z'th 
step and the rest of the vertices as clustered by D. 

H* = (H*,..., IT k ) 

n = {n 1 ,...,n k ) 

forj = l,...k, H) = (u£}W/) U (U^) 
for j = l,...k, H) = H)\V l 



Although we will go through all possible clusterings of S\ for the rest of the analysis consider the 
particular clustering that matches the hybrid clusterings, 

for j = l...k, 5* = Fnm 

The following theorem captures the fact that our random sample with high probability gives 
us a good estimate on the number of agreements towards each cluster for most of the vertices 
considered. 

Lemma 5 For % = 1 . . . m, with probability at least 1 — (5 /Am) on the choice of S l , for all but at 
most an e/S fraction of the vertices v G V\ the following holds 



for j = l,...k, 



\r+(v)nS}\ \r+( v )nH}\ 



<-. (i) 

~32 



r |^\^| 

(Note that z/(|7J above holds, then it also holds with T~(v) in place ofY + {y).) 

Proof: Consider an arbitrary vertex v G V % and the randomly chosen set S % = {u\, . . . , u r }. For 
each j G {1, . . . , k}, we define the random variables 



for I = 1, . . .r, a 1 



i, ifuier+(v)nsy, 

0, otherwise. 
\r+(v)nm\ 



Clearly £[ =1 a) = \Y+(v) n 5j| and Pr[a) = 1] - 
Using an additive Chernoff bound we get that 



Pr 



ns)\ \r + {v)nm 



\v\ v*\ 



e 

> — 
32 



< 2 ■ exp(-2(— fr) < (2) 
PV v 32 ; ; 32mfc 



Defining a random variable to count the number of vertices not satisfying inequality© and 
using Markov's inequality we get that for that particular j, inequality© holds for all but a fraction 
e/8 of vertices v G V\ with probability at least 1 — (5/Amk). Using a probability union bound the 
lemma easily follows. ■ 

We define agree(A) to be equal to the number of agreements induced by ^-clustering A. Now 
consider the placement of V 1 vertices in clusters W[, . . . , W], as performed by the algorithm during 
step i. We will examine the number of agreements compared to the placement of the same vertices 
under H % (placement under the optimal clustering), more specifically we will bound the difference 
in the number of agreements induced by placing vertices differently than H l . The following lemma 
formalizes this concept. 
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Lemma 6 For % = 0, . . . m, we have agree(-£P +1 ) > agree(D) — % ■ |e 2 n 2 

Proof : Observe that H 1 = D and H m+1 = The only vertices placed differently between H l+1 
and H l are the vertices in V\ Suppose that our algorithm places v E V 1 in cluster x and v is placed 
in cluster x' under H\ For each vertex v the number of agreements towards clusters other than 
x, x' remains the same, therefore we will focus on the number of agreements towards these two 
clusters and the number of agreements within V\ 

The number of agreements we could lose by thus misplacing v is 

diff M /(v) = |r + (u) n 741 - |r + (f) n n l x \ + |r~(v) n W x \ - \v~{v) n 74 1 

Since our algorithm chose cluster x, by construction 

|r+(«) n si\ + \r~(v) n SU > \r+(v) n + |r-(«) n si\ (3) 



If inequality ([T]) holds for vertex using it for T + (v) and r~(u) in both clusters x,x', we 
obtain bounds on the difference of agreements between our random sample's clusters S l x , S l x , 
and the hybrid clusters H x ,1-C x ,. Combining with inequality © we get that diff^^) is at most 
\sn. Therefore the total decrease in the number of agreements by this type of vertices is at most 

8 1 1 — 8 m 

By Lemma|5]there are at most (e/8)|V™| vertices in V 1 for which inequality ^ doesn't hold. 
The total number of agreements originating from these vertices is at most |e| V l \n < Finally, 
the total number of agreements from within V* is at most I V l \ 2 < \e—. 

o 8 II — Am 

Overall the number of agreements that we could lose in one step of the algorithm is at most 
\e— < \e 2 n 2 . The lemma follows by induction. ■ 

2 in — 8 J 

The approximation guarantee of Theorem[4]easily follows from Lemma|6l At each step of the 
algorithm we need to go over all k r /c-clusterings of our random sample S' 1 and there are 0(n/e) 
steps. All other operations within a step can be easily implemented to run in constant time and the 
running time bound of our algorithm follows as well. ■ 

4 PTAS for minimizing disagreements with k clusters 

This section is devoted to the proof of the following theorem, which is our main result in this paper. 



Theorem 7 (Main) For every k > 2, there is a PTAS for MinDisAgree[/c]. 

The algorithm for MinDisAgree[A;] will use the approximation scheme for MaxAgree[A;] as a 
subroutine. The latter already provides a very good approximation for the number of disagreements 
unless this number is very small. So in the analysis, the main work is for the case when the optimum 
clustering is right on most of the edges. 
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4.1 Idea behind the algorithm 

The case of 2-clusters turns out to be lot simpler and we use it to first illustrate the basic idea. By 
the PTAS for maximization, we only need to focus on the case when the optimum clustering has 
only OPT = 772 2 disagreements for some small 7 > 0. We draw a random sample S and try all 
partitions of it, and focus on the run when we guess the right partition S — S% U S2, namely the 
way some fixed optimal clustering V partitions S. Since the optimum has very large number of 
agreements, each node in a set A of size at least (1 — 0(j))n will have a clear choice of which side 
they prefer to be on and we can find this out with high probability based on edges into S. Therefore, 
we can find a clustering which agrees with Dona set A of at least 1 — 0(7) fraction of the nodes. 
We can then go through this clustering and for each node in parallel switch it to the other side 
if that improves the solution to produce the final clustering. Nodes in A won't get switched and 
will remain clustered exactly as in the optimum V. The number of extra disagreements compared 
to V on edges amongst nodes in V — A is obviously at most the number of those edges which 
is 0(7 2 n 2 ). For edges connecting a node u E V — A to nodes in A, since we placed u on the 
"better" side, and A is placed exactly as in V in the final clustering, we can have at most 0(777,) 
extra disagreements per node compared to V, simply the error introduced by the edges towards 
the misplaced nodes in V — A. Therefore we get a clustering with at most OPT + 0{^ 2 n 2 ) < 
(1 + 0(7)) OPT disagreements. 

Our algorithm for ^-clustering for k > 2 uses a similar high-level approach, but is more com- 
plicated. The main thing which breaks down compared to the k = 2 case is the following. For two 
clusters, if V has agreements on a large, i.e. (1 — 0(7)), fraction of edges incident on a node u 
(i.e. if u E A in the above notation), then we are guaranteed to place u exactly as in V based on 
the sample S (when we guess its correct clustering), since the other option will have much poorer 
agreement. This is not the case when k > 2, and one can get a large number of agreements by 
placing a node in say one of two possible clusters. It therefore does not seem possible to argue that 
each node in A is correctly placed, and then to use this to finish off the clustering. 

However, what we can show is that nodes in A that are incorrectly placed, call this set B, must 
be in small clusters of V, and thus are few (at most 0(n/k)) in number. Moreover, every node 
in A that falls in one of the large clusters that we produce, is guaranteed to be correctly placed. 
(These facts are the content of Lemma [TOl) The nodes in B still need to be clustered, and since 
they could be of size VL{n/k), even a small number of mistakes per node in clustering them is more 
than we can afford. We get around this predicament by noting that nodes in B and A — B are in 
different sets of clusters in V. It follows that we can cluster B recursively in new clusters (and 
we are making progress because B is clustered using fewer than k clusters). The actual algorithm 
must also deal with nodes outside A, and in particular decide which of these nodes are recursively 
clustered along with B. With this intuition in place, we now proceed to the formal specification of 
the algorithm. 

4.2 Algorithm for /c-clustering to minimize disagreements 

The following is the algorithm that gives a factor (1 + e) approximation for MinDisAgree[/c]. 
We will use a small enough absolute constant c x in the algorithm; the choice C\ = 1/20 will work. 
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Algorithm MinDisAg(&, e): 

Input: A labeling C : (™) — > {+, — } of the edges of the complete graph on vertex set V = 
{Lj,...,n} 

Output: A ^-clustering of the graph, i.e., a partition of V into (at most) k parts Vx, V 2 , . . . , 

0. If k = 1, return the obvious 1-clustering. 

e 2 c 2 

1. Run the PTAS for Max Agree [A;] from previous section on input C with accuracy -^A. 

Let ClusMax be the /c-clustering returned. 

2. Set (3 = ^p. Pick a sample S C V by drawing 51 °f - vertices u.a.r with replacement. 

3. ClusVal <— 0; /* Keeps track of value of best clustering found so far*/ 

4. For each partition S of S as Si U S 2 U ■ • • U S fc , perform the following steps: 

(a) Initialize the clusters Cj = Sj for 1 < 2 < fc. 

(b) For each u G V — S 

(i) For each % — 1,2, ... ,k, compute pval 5 (u, i), defined to be 1/|S| times the number of 
agreements on edges connecting u to nodes in S if u is placed in cluster i along with Sj. 

(ii) Let j u = arg max i pval s (M, i), and val 5 (u) = f pval s (n, j u ). 

(iii) Place u in cluster Cj u , i.e., Cj u <— Cj u U {u}. 

(c) Compute the set of large and small clusters as 

Large = {j \ 1 < j < k, \Cj\ > and Small = {1,2, . . . ,k} - Large. 
Let I = | Large] and s = k - I = (Small |. /* Note that s < k. */ 

(d) Cluster W == IJjeSmaii mt0 s cmsters using recursive call to algorithm MinDisAg(s, e/10). 
Let the clustering output by the recursive call be W = W[ U W' 2 U ■ ■ • U W' s 

(where some of the W/'s may be empty) 

(e) Let C be the clustering comprising of the k clusters {CjjjgLarge and {W/}i<i<«. 

If the number of agreements of C is at least ClusVal, update ClusVal to this value, and 
update ClusMin <— C. 

5. Output the better of the two clusterings ClusMax and ClusMin. 

4.3 Performance analysis of the algorithm 

We now analyze the approximation guarantee of the above algorithm. We need some notation. Let 
A = Ax U A 2 U • • • Ak be any ^-clustering of the nodes in V . Define the function val" 4 : V — ► [0, 1] 
as follows: val" 4 (w) equals the fraction of edges incident upon node u whose labels agree with 
clustering A (i.e., we count negative edges that are cut by A and positive edges that lie within the 
same Ai for some i). Also define disagr(^l) to be the number of disagreements of A w.r.t labeling 
L. (Clearly disagr(^l) = a=± E uG y( 1 ~ val" A (n)).) For a node u e V and 1 < z < k, let A {u ^ 
denote the clustering obtained from A by moving u to Ai and leaving all other nodes untouched. 
We define the function pval" 4 : V x {1, 2, . . . , k} — > [0, 1] as follows: pva\ A (u, i) equals the fraction 
of edges incident upon u that agree with the clustering A^ u,% \ 

In the following, we fix V to be any optimal ^-clustering that partitions V as V = Dx U D 2 U 
• • • U D k . Let 7 be defined to be so that disagr(D)/n 2 , i.e., the clustering V has 771 2 disagreements 
w.r.t. the input labeling L. 
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Call a sample S of nodes, each drawn uniformly at random with replacement, to be a-good if 
the nodes in S are distinct 3 and for each u E V and i G {1, 2, ... , k}, 



\pva\ s (u,i) - pva\ v (u,i)\ < a (4) 

for the partition S of S as U| =1 S'j with Si — S fl Di (where pval 5 (-, •) is as defined in the algo- 
rithm). The following lemma follows by a standard Chernoff and union bound argument similar to 
Lemma|5] 4 

Lemma 8 The sample S picked in Step 2 is [3 -good with high probability ( at least 1—0(1/ y/n) ). 

Therefore, in what follows we assume that the sample S is /?-good. In the rest of the discussion, 
we focus on the run of the algorithm for the partition S of S that agrees with the optimal partition 
V, i.e., Si = S fl Di. (All lemmas stated apply for this run of the algorithm, though we don't make 
this explicit in the statement.) Let (Ci, C 2 , . . . , Ck) be the clusters produced by the algorithm at 
end of Step 4(c) on this run. Let's begin with the following simple observation. 

Lemma 9 Suppose a node u G D s is placed in cluster C r at the end of Step 4(b) for r ^ s, 
1 < r, s < k. Then pvaP(u, r) > pvaP(u, s) - 2/3 = vaP(u) - 2/?. 

Proof: Note that since u G D s , vaP(tt) = pvaP(w, s). By the /5-goodness of S (recall Equation 
©), pval 5 (u, s) > pvaP(n, s) — f3. Since we chose to place u in C r instead of C s , we must have 
pval 5 (n, r) > pval 5 (u, s). By the /5-goodness of S again, we have pvaP(n, r) > pval 5 (w, r) — /3. 
Combining these three inequalities gives us the claim of the lemma. ■ 

Define the set of nodes of low value in the optimal clustering V as X| ow = {u I vaP(u) < 
1 — ci/k 2 }. The total number of disagreements is at least the number of disagreements induced 
by these low valued nodes, therefore 

2fc 2 disagr(£>) 2k 2 -fn 2 Ak 2 ^n 

-Mow < —/ 7T — 7 7T < • (5) 

[n — l)c\ [n — ljci ci 
The following key lemma asserts that the large clusters produced in Step 4(c) are basically correct. 



Lemma 10 Suppose 7 < ^#3. Let Large C {1, 2, . . . , k} be the set of large clusters as in Step 
4(c) of the algorithm. Then for each i G Large, Cj — T| ow = Di — X| ow , that is w.r.t nodes of large 
value, Ci precisely agrees with the optimal cluster 

Proof : Let i G Large be arbitrary. We will first prove the inclusion Ci — T| ow C Di — T\ ow . Suppose 
this is not the case and there exists u G Ci — (AUT| ow ). Letw G D 3 for some j 7^ i. Since u ^ T\ ow , 

3 Note that in the algorithm we draw elements of the sample with replacement, but for the analysis, we can pretend 
that S consists of distinct elements, since this happens with high probability. 

4 Since our sample size is f2(log n) as opposed to 0(1) that was used in Lemma[5] we can actually ensure (|4} holds 
for every vertex w.h.p. 
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we have vaP(w) > 1 — Ci/k 2 , which implies pvaP(w, j) > 1 — Ci/A; 2 . By Lemma|5J this gives 
pval 5 (u, i) > 1 — ci//c 2 — 2/5. Therefore we have 

2(1 - ci/Jfe 2 - /3) < pvaP(w,z') + pvaP(«,i)<2-^'" 



n 

where the last step follows from the simple but powerful observation that each edge connecting u 
to a vertex in A U -Dj is correctly classified in exactly one of the two placements of u in the i'th 
and j'th clusters (when leaving every other vertex as in clustering V). We conclude that both 

\Dil\Dj\ <2{^ + P)n+l. (6) 

What we have shown is that if u E Ci — (A U X| ow ), then u E Dj for some j with \Dj\ < 
2{ Cl /k 2 + (3)n + 1. It follows that \d - (A U T low )\ < 2{ Cl /k + /3fc)n + k. Therefore, 



I A| > 1^1 - |How| - 2(£ + /3A;)n - A; > - - 2(£ + /3k)n - k > 2{% + f3)n + 1 

k 2k ci k k l 



where the last step follows since 7 < ^fp, k > 2, ci = 1/20, and /3 is tiny. This contradicts ©, 
and so we conclude C, — T] ow CD; - Tj ow . 

Now for the other inclusion — T] ow C Cj — Tj ow . If a node v E Di — {Ci U Tj ow ) is placed in 
C 9 for g 7^ i, then a similar argument to how we concluded © establishes | A| < 2(j& + (3)n + 1, 
which is impossible since we have shown Di ^ Ci — T| ow , and hence |A| > |Cj| — |Ti ow | > 
_H _ 4^7n > 2(|^ + + 1, where the last step follows using 7 < ^ and fc > 2 for the choice 
ci = 1/20. ■ 

The next lemma states that there is a clustering which is very close to optimum which agrees 
exactly with our large clusters. This will enable us to find a near-optimal clustering by recursing 
on the small clusters to recluster them as needed, exactly as our algorithm does. 

Lemma 11 Assume 7 < yfp-- There exists a clustering T that partitions V asV = FiUF 2 U- • • F k 
that satisfies the following: 

(i) Fi = Ci for every i E Large 

(ii) The number of disagreements of the clustering T is at most disagr(jF) < jn 2 ^1 + + 

2fc 2 7 ' 
ci - 

Proof : Suppose w E T\ ow is such that w E C r , w E D s with r ^ s. Consider the clustering formed 
from V by performing the following in parallel for each w E X| ow : If w E C r and w E D s for 
some r ^ s, move w to D r . Let T = F\ U • ■ ■ U F k be the resulting clustering. By construction 
Fi H T| ow = Ci n T| ow for alH, 1 < i < k. Since we only move nodes in T| ow , clearly Fi — T\ ow = 
Di — T] ow for 1 < i < k. By Lemma fTOl Cj — T] ow = A — 7iow for i G Large. Combining all 
these equalities we conclude that Fi = Ci for each i E Large. 



12 



Now the only extra edges that the clustering T can get wrong compared to V are those incident 
upon nodes in T| ow , and therefore 

disagr(.F) - disagr(P) < (n - 1) ^ (vaP(w) - val^O)) (7) 



low 



If a node w belongs to the same cluster in T and V (i.e., we did not move it), then since no node 
outside T| ow is moved in obtaining T from V, we have 

val^H > vaP(w) - |71ow|/(n - 1) . (8) 

If we moved a node w G T] ow from D s to -D r , then by Lemma|9]we have pva\ v (w, r) > va\ v (w) — 
2(5. Therefore for such a node w 

vaf(w) > pvaP(w,r) - |T, ow |/(n - 1) > vaP(™) - 2/3 - |T, ow |/(n - 1) . (9) 



n— 1 < 



Combining ©, © and ©, we can conclude disagr(^ r ) — disagr(P) < {n — l)|Xi ow |(2/3 + 
The claim now follows using the upper bound on |T| ow | from © (and using n 2 / [n — l) 2 < 2). 

Lemma 12 If the optimal clustering V has •yn 2 disagreements for 7 < then the clustering 
ClusMin found by the algorithm makes at mostyn 2 {l + e/10)(l + 4k 2 j3/ci + %k A y/c{) disagree- 
ments. 

Proof : We note that when restricted to the set of all edges except those entirely within W, the set of 
agreements of the clustering C in Step 4(e) coincides precisely with that of T. Let n>i be the number 
of disagreements of T on edges that lie within W and let n 2 be the number of disagreements on 
all other edges. Since W is clustered recursively, we have the number of disagreements in C is at 
most n 2 + ni(l + e/10) < (ni + n 2 )(l + e/10). The claim follows from the bound on ri\ + n 2 
from Lemma fTTl Part (ii). ■ 

Theorem 13 For every e > 0, algorithm MinDisAg(A;, e) delivers a clustering with number of 
disagreements within a factor (1 + e) of the optimum. 

Proof: Let OPT = yn 2 be the number of disagreements of an optimal clustering. The solution 

ClusMax returned by the maximization algorithm has at most OPT + £ 3 ^" = 7?t, 2 ^1 + 

disagreements. The solution ClusMin has atmost7ri 2 (l+£/10)(l+4/c 2 /3/c 1 + 8/c 4 7/c 2 )) disagree- 

ments. If 7 > the former is within (1 + e) of the optimal. If 7 < (which also satisfies the 
requirement 7 < C1/I6A; 3 we had in Lemma [T2l. the latter clustering ClusMin achieves approxi- 
mation ratio (1 + e/10)(l + e/2) < (1 + e) (recall that (3 < j^). Thus the better of these two 
solutions is always an (1 + e) approximation. ■ 

To conclude Theorem |7J we examine the running time of MinDisAg. Step 4 will be run for 
fc\s\ _ n o(k /e ) iterations. During each iteration, the placement of vertices is done in 0(n log n). 
Finally, observe that there is always at least one large cluster, therefore the recursive call is always 
done on strictly less clusters. It follows that the running time of MinDisAg(A;, e) can be described 
from the recurrence T(k, e) < n 0( - k I s \n \ogn + -T(k — 1, e/10)) from which we derive that the 
total running time is bounded by n ' 100 l e ) logrz. 
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5 Complexity on general graphs 



So far, we have discussed the MaxAgree[&;] and MinDis Agree [k] problems on complete graphs. 
In this section, we note some results on the complexity of these graphs when the graph can be ar- 
bitrary. As we will see, the problems become much harder in this case. 

Theorem 14 There is a polynomial time factor 0.878 approximation algorithm for MaxAgree[2] 
on general graphs. For every k > 3, there is a polynomial time factor 0.7666 approximation 
algorithm for M AXAGREE [k] on general graphs. 

Proof : The bound for 2-clusters case follows from the Goemans-Williamson algorithm for Max 
CUT modified in the obvious way to account for the positive edges. The bound for k > 3 is 
obtained by Swamy [15] who also notes that slightly better bounds are possible for 3 < k < 5. ■ 

We note that in light of the recent hardness result for Max CUT lfl~3l . the above guarantee for 
MaxAgree[2] is likely the best possible. 

Theorem 15 There is a polynomial time 0(\/\ogn) approximation algorithmfor MinDisAgree[2] 
on general graphs. For k > 3, MinDisAgree[A;] on general graphs cannot be approximated 
within any finite factor. 

Proof : The bound for 2-clustering follows by the simple observation that MinDisAgree[2] on 
general graphs reduces to Min 2CNF Deletion, i.e., given an instance of 2SAT, determining the 
minimum number of clauses that have to be deleted to make it satisfiable. The latter problem 
admits an O ( \J\og n) approximation algorithm [Ij. The result on MinDis Agree [&;] for k > 3 
follows by a reduction from A;-coloring. When k > 3, it is NP-hard to tell if a graph is A;-colorable, 
and thus even given an instance of MinDis Agree [&;] with only negative edges, it is NP-hard to 
determine if the optimum number of disagreements is zero or positive. ■ 
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